These materials have been taken from the Software Carpentry: R Novice Lesson. You can find the original materials here


This lesson will cover some basic functions that can be used to manipulate data in R. Again, we will be using the gapminder data set, which includes country information on GDP, population, etc.


There are five main functions we’ll be talking about today, each allowing us to manipulate data frames. These five functions are:


If you haven’t already, make sure you have dplyr() and gapminder() installed and loaded with the following commands:

# Download the packages
# install.packages(c("tidyverse", "gapminder"))

# Load the packages for use
library(tidyverse)
gapminder <- read.csv("../data/gapminder_data.csv")


Let’s take a quick look at our data frame to remind ourselves of its structure. We do this using the head() command, which will display the first 6 rows of our data frame.

head(gapminder)
      country continent year lifeExp      pop gdpPercap
1 Afghanistan      Asia 1952  28.801  8425333  779.4453
2 Afghanistan      Asia 1957  30.332  9240934  820.8530
3 Afghanistan      Asia 1962  31.997 10267083  853.1007
4 Afghanistan      Asia 1967  34.020 11537966  836.1971
5 Afghanistan      Asia 1972  36.088 13079460  739.9811
6 Afghanistan      Asia 1977  38.438 14880372  786.1134


Quick aside on tibbles and data frames

In R, one of the main types of objects/variables we’re going to be working with is a data frame. This is much like a table you would view in Excel, where column represent variables or measures and rows represent measurements, samples, or observations.


Choose Columns: select

The first function we’ll be using is select(). This function let’s us pick columns from our data frame, based on name (e.g. year) or by index (e.g. 3).

Let’s try using select() to pick out a few columns: “country”, “year”, “lifeExp”, and “pop”. We’ll be assigning these columns to a new data frame, gapminder_select. Then we’ll use head() to see if it worked.

# select() code here
gapminder_select <- select(gapminder, country, year, lifeExp, pop)

# Check the data frame
head(gapminder_select)
      country year lifeExp      pop
1 Afghanistan 1952  28.801  8425333
2 Afghanistan 1957  30.332  9240934
3 Afghanistan 1962  31.997 10267083
4 Afghanistan 1967  34.020 11537966
5 Afghanistan 1972  36.088 13079460
6 Afghanistan 1977  38.438 14880372


As you can see, our new data frame contains only a subset of the columns from the original data frame, based on the names we provided in the select() command.


Here we’ll also introduce another great feature of dplyr(): the pipe ( %>% ). This symbol sends or pipes an object (e.g. a data frame like gapminder) INTO a function (e.g. select()).

So, the above select() command can be rewritten as follows (NOTE: the “.” is a placeholder, which represents the object being piped). Again, we can check our result using head().

# select() using pipe syntax
gapminder_pipe <- gapminder %>% select(., country, year, lifeExp, pop)

head(gapminder_pipe)
      country year lifeExp      pop
1 Afghanistan 1952  28.801  8425333
2 Afghanistan 1957  30.332  9240934
3 Afghanistan 1962  31.997 10267083
4 Afghanistan 1967  34.020 11537966
5 Afghanistan 1972  36.088 13079460
6 Afghanistan 1977  38.438 14880372


We can actually simplify the above command further - dplyr’s functions such as select() are smart enough that you don’t actually need to include the “.” placeholder, as shown below.

# select() using pipe syntax w/out a placeholder
gapminder_pipe2 <- gapminder %>% select(country, year, lifeExp, pop)

head(gapminder_pipe2)
      country year lifeExp      pop
1 Afghanistan 1952  28.801  8425333
2 Afghanistan 1957  30.332  9240934
3 Afghanistan 1962  31.997 10267083
4 Afghanistan 1967  34.020 11537966
5 Afghanistan 1972  36.088 13079460
6 Afghanistan 1977  38.438 14880372


Challenge 1

Using the select() command and pipe (%>%) notation, pick the following columns from the gapminder data frame, assign them to a new variable (we’ll use x), and display the results using head(x, n = 10). Columns to choose are:

  • continent
  • GDP per capita
  • life expectancy
  • year
# Answer here:
x <- select()


Choose Rows: filter

So we’ve covered selecting columns, but what about rows? This is where filter() comes in. This function allows us to choose rows from our data frame using some logical criteria. An example is filtering for rows in which the country is Canada. This can also be applied to numerical values, such as the year being equal to 1967, or life expectancy greater than 30.

NOTE: In R, equality (e.g. country is Canada, year is 1967) is done using a double equals sign (==).

Let’s go through a couple examples.

# Filter rows where country is Canada
gapminder_canada <- gapminder %>% filter(country == "Canada")

head(gapminder_canada)
  country continent year lifeExp      pop gdpPercap
1  Canada  Americas 1952   68.75 14785584  11367.16
2  Canada  Americas 1957   69.96 17010154  12489.95
3  Canada  Americas 1962   71.30 18985849  13462.49
4  Canada  Americas 1967   72.13 20819767  16076.59
5  Canada  Americas 1972   72.88 22284500  18970.57
6  Canada  Americas 1977   74.21 23796400  22090.88


Let’s try another one, this time filtering on life expectancy above a certain threshold:

# Filter for rows where life expectancy is greater than 50
gapminder_LE <- gapminder %>% filter(lifeExp > 50)

head(gapminder_LE)
  country continent year lifeExp     pop gdpPercap
1 Albania    Europe 1952   55.23 1282697  1601.056
2 Albania    Europe 1957   59.28 1476505  1942.284
3 Albania    Europe 1962   64.82 1728137  2312.889
4 Albania    Europe 1967   66.22 1984060  2760.197
5 Albania    Europe 1972   67.69 2263554  3313.422
6 Albania    Europe 1977   68.93 2509048  3533.004


We can also filter with multiple arguments, each separated by a comma:

# filter() for Canada and life expectancy greater than 80
gapminder_C_LE <- gapminder %>% filter(country == "Canada", lifeExp > 80)

head(gapminder_C_LE)
  country continent year lifeExp      pop gdpPercap
1  Canada  Americas 2007  80.653 33390141  36319.24


Challenge 2

Use filter() to choose data for African countries, from the year 1980 and onwards.

# Challenge 2 code here
x <- filter()


Create New Columns: mutate()

Let’s say we now want to calculate the GDP in billions, which is done by mutiplying the GDP per capita by the population, then dividing by 1 billion (1 * 10^9). mutate() will perform this calculation on each row in the data frame, one row at a time (i.e. row-wise). The code below will calculate the GDP in billions:

# Use mutate() to calculate GDP in billions
gapminder_gdpBil <- gapminder %>% mutate(gdp_billion = gdpPercap * pop / 10^9)

head(gapminder_gdpBil)
      country continent year lifeExp      pop gdpPercap gdp_billion
1 Afghanistan      Asia 1952  28.801  8425333  779.4453    6.567086
2 Afghanistan      Asia 1957  30.332  9240934  820.8530    7.585449
3 Afghanistan      Asia 1962  31.997 10267083  853.1007    8.758856
4 Afghanistan      Asia 1967  34.020 11537966  836.1971    9.648014
5 Afghanistan      Asia 1972  36.088 13079460  739.9811    9.678553
6 Afghanistan      Asia 1977  38.438 14880372  786.1134   11.697659


Combine Functions with Pipes

We’ve seen that pipes ( %>% ) can be used to send an object such as a data frame into a function, such as select(), or filter(). But they can also be used to send the output of one function into another function. This allows us to chain together multiple commmands, without the need for intermediate variables.

Let’s take a look at this in an example.

# select() the five columns, and filter() for Canada
gapminder_multi <- gapminder %>% 
    select(country, year, lifeExp, pop, gdpPercap) %>% 
    filter(country == "Canada")

head(gapminder_multi)
  country year lifeExp      pop gdpPercap
1  Canada 1952   68.75 14785584  11367.16
2  Canada 1957   69.96 17010154  12489.95
3  Canada 1962   71.30 18985849  13462.49
4  Canada 1967   72.13 20819767  16076.59
5  Canada 1972   72.88 22284500  18970.57
6  Canada 1977   74.21 23796400  22090.88


We can further expand on this by incorporating our mutate() command from earlier, linking multiple functions into a single command. Be sure to indent (TAB key) when moving to a new line after a pipe.

# select() the four columns, filter() for Canada, and calculate GDP in billions
gapminder_multi_2 <- gapminder %>% 
    select(country, year, lifeExp, pop, gdpPercap) %>% 
    filter(country == "Canada") %>% 
    mutate(gdp_billion = gdpPercap * pop / 10^9)

head(gapminder_multi_2)
  country year lifeExp      pop gdpPercap gdp_billion
1  Canada 1952   68.75 14785584  11367.16    168.0701
2  Canada 1957   69.96 17010154  12489.95    212.4560
3  Canada 1962   71.30 18985849  13462.49    255.5967
4  Canada 1967   72.13 20819767  16076.59    334.7108
5  Canada 1972   72.88 22284500  18970.57    422.7497
6  Canada 1977   74.21 23796400  22090.88    525.6835


Calculations with group_by and summarise

These functions allow us to work on our data in specific groups. For example, we can use group_by() to group observations by country, then calculate the average life expectancy for each country.

# group_by() country, calculate average life expectancy
gapminder_grp <- gapminder %>% 
group_by(country) %>% 
summarise(mean(lifeExp))

head(gapminder_grp)
# A tibble: 6 x 2
  country     `mean(lifeExp)`
  <fct>                 <dbl>
1 Afghanistan            37.5
2 Albania                68.4
3 Algeria                59.0
4 Angola                 37.9
5 Argentina              69.1
6 Australia              74.7


Let’s do another example, again grouping by country. This time, we’ll calculate the mean and standard deviation of the GDP per capita. We’ll also specify the column names inside of the summarise() command.

gapminder_mean_sd <- gapminder %>% 
    group_by(country) %>% 
    summarise(mean_gdp = mean(gdpPercap), sd_gdp = sd(gdpPercap))

head(gapminder_mean_sd)
# A tibble: 6 x 3
  country     mean_gdp sd_gdp
  <fct>          <dbl>  <dbl>
1 Afghanistan     803.   108.
2 Albania        3255.  1192.
3 Algeria        4426.  1310.
4 Angola         3607.  1166.
5 Argentina      8956.  1863.
6 Australia     19981.  7815.


Combining tables with left_join()

Let’s say you have a data frame you’ve generated, containing a list of gene IDs and some corresponding value, such as expression. And you have another table that contains many (e.g. all) human gene IDs, as well as the names for those genes. For example the gene ID “ENSG00000012048” corresponds to the gene BRCA1. Now let’s say you want to map between the two tables using the gene ID, to create a single table with ID-Expression-Name for each gene. That’s what left_join() is for!

This function takes one data frame “x” and using a specified column, looks for matching entries in “y”. Note that the output data frame will contain all rows and columns from “x”, as well as all columns from “y”, but only matching rows from “y”.

First let’s load some simple example data to play with:

fruits1 <- read.csv("../data/fruits_table1.csv", stringsAsFactors = FALSE)
fruits2 <- read.csv("../data/fruits_table2.csv", stringsAsFactors = FALSE)

head(fruits1)
  FruitID Fruit_Exp
1      f1         1
2      f2        -1
3      f3         0
4      f4         4
5      f5        -8
head(fruits2)
  FruitID FruitName
1      f1     apple
2      f2    orange
3      f3    banana
4      f4      pear
5      f5 blueberry
6      f6     melon

Now we can use left_join() to combine the two tables, based on matching values in a specified column. The syntax is as follow:

left_join(fruits1, fruits2, by = "FruitID")
  FruitID Fruit_Exp FruitName
1      f1         1     apple
2      f2        -1    orange
3      f3         0    banana
4      f4         4      pear
5      f5        -8 blueberry


Extra: Note that you can have different column names in each of your data frames, and still join the tables together. The syntax for this is:

left_join(x, y, by = c("columnX" = "columnY"))


Tying it all together

Now let’s use all the commands we’ve covered and combine them with pipes into a single statement.

Let’s say we want calculate the mean and SD of the GDP (in billions) for each country, but only considering data from 1980 and onwards. We can accomplish this all in one step as follows.

# select() columns, filter() by year, calculate GDP in billions, mean() and sd()
# of GDP in billions
gapminder_final <- gapminder %>% 
    select(country, year, pop, gdpPercap) %>% 
    filter(year >= 1980) %>% 
    mutate(gdp_billion = gdpPercap * pop / 10^9) %>% 
    group_by(country) %>% 
    summarise(mean_gdpBillion = mean(gdp_billion), sd_gdpBillion = sd(gdp_billion))

head(gapminder_final)
# A tibble: 6 x 3
  country     mean_gdpBillion sd_gdpBillion
  <fct>                 <dbl>         <dbl>
1 Afghanistan            16.4          7.66
2 Albania                13.1          4.84
3 Algeria               149.          33.2 
4 Angola                 28.9         15.5 
5 Argentina             353.          91.5 
6 Australia             478.         154.